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Abstract 

The problem of assigning probabilities when little is known is analized 
in the case where the quanities of interest are physical observables, i.e. 
can be measured and their values expressed by numbers. It is pointed out 
that the assignment of probabilities based on observation is a process of 
inference, involving the use of Bayes' theorem and the choice of a prob- 
ability prior. When a lot of data is available, the resulting probability 
are remarkably insensitive to the form of the prior. In the oposite case of 
scarse data, it is suggested that the probabilities are assigned such that 
they are the least sensitive to specific variations of the probability prior. In 
the continuous case this results in a probability assignment rule wich calls 
for minimizing the Fisher information subject to constraints reflecting all 
available information. In the discrete case, the corresponding quantity to 
be minimized turns out to be a Renyi distance between the original and 
the shifted distribution. 



1 Introduction 

The problem of probability assignment has been stirring debates and contro- 
versy ever since Laplace introduced the notion of indifference as an argument 
in specifying prior distributions. He thus started a quest for a statistical Holy 
Grail: prior distributions reflecting ignorance. Today, more than two centuries 
later, a satisfactory solution to this problem is still elusive. In what follows we 
offer a physicist's take on the somewhat narrower problem of assigning proba- 
bilities for measurable quantities, or, as physicists call them, observables. Strict 
space limitations will force the expose to be much more concise than it should 
have been, but hopefully the main message will be able to come through. 

'Presented at MaxEnt 2012, the 32nd International Workshop on Bayesian Inference and 
Maximum Entropy Methods in Science and Engineering, July 15-20, 2012, Garching near 
Munich, Germany. 



2 Probabilities as opinions: an objective take on 
subjectivity 

When we state " A has a probability p of being true" , what we really mean is " We 
don't know whether A is true or false, yet we believe that, if our world existed 
together with a number of its replicas, A would be true in pN out of N of them 
when N— > oo" . Now, it should be evident that, because of the implied limit 
procedure, there is no practical way of verifying this statement. One cannot 
possibly reproduce a given physical situation down to its ever minute details 
several, let alone infinite, number of times - hence "we believe". Without this 
leap of faith no rational science would be possible. An example of this sort of 
belief can be found in Mechanics - we know that material points do not exist, 
but we believe that if they did, they would behave according to the Fist, Second 
and Third Newton's laws. The source of our faith in this case are countless 
observations of the behavior of real objects from afar. It is practice that sorts out 
"good" from "bad" beliefs. Different people, however, have different experiences, 
so beliefs are subjective and may differ significantly from one person to another. 
It is, therefore, of significant interest to inquire what is it that makes it possible 
for rational agents to agree among themselves on what exactly they observe. 
To that end, let us try to walk in Laplace's "inverse probabilities" footsteps in 
analyzing how opinions are formed from observations. The following builds on 
[2]- 

2.0.1 The Anatomy of a Measurement 

For our purposes, we shall simplistically call "a measurement" a well-defined 
procedure to put a real number in correspondence with a physical phenomenon. 
Usually we have a good idea what the range TZ = [a, b] of this number is, 
but the practicalities of the particular procedure prevent it from being precise. 
Thus, instead of a real number £ [a, b] the outcome of a single measurement 
is rather a pointer (index) i to a subinterval Di C [a, b] where [a, b] — U™ =1 Dj 
and Di n Dj^i = 0. Repeating the measurement m times we end up with 
a histogram of n bins where each bin i contains Sj - the number of times the 
measurement fell in that bin. Obviously, X)"=i Si = m - Now, if a result in bin 
i had an assigned probability pi in a single measurement, probability theory 
teaches us that the probability of a set {s} is of the multinomial form P(s|p) = 
n „ m ' s j pl 1 P2 2 ■ ■ -pfr- We, however, are interested in the opposite situation - the 
results of the measurements {s} are known, and we want to assign probabilities. 
In this case we recognize P(s|p) as the likelihood and apply the Bayes theorem 
to obtain the probability of an assignment {p} given the measurements {s} 

P(p\s) = AT 1 p^p? ■ ■ -p>(p)<5(S?=i^ - 1) (1) 

1 " Repeating" here is a misnomer - what is meant is an " ensamble" of replicas of the world 
with one measurement performed in each of its members. 
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where Af = J cPpp^p^ 2 • ■ •p^""7r(p)<J(S"_ 1 pj — 1) is a normalization factor. 7r(p) 
is a probability prior which originates in whatever knowledge we have about 
the phenomenon in question, the measuring procedure and the structure of 
the domain's decomposition [a,b] — U™ =1 Dj. For example, one might find it 
reasonable to assign prior probability proportional to the measure (length) of 
Di etc. 



2.0.2 The Role of the Probability Prior 



With (eqn.l) the most natural way to assign the individual probabilities is as 
the expectations 



< Pk >=M~ 1 / <rpp? P $ 



■p'l k+1 



■p^7r(p)5(E? =1 pi-l) 



(2) 



where the integration is over the unit hypercube p. G [0,1]. For a uniform prior 
?r(p) = 1 the integration [1] produces < p k >= ^±2. = _L^. (/ fe + _L) where 

/, = Si/m are the "sample frequencies". The variances of this assignment are 
easily calculated to be < (Ap k ) 2 >= m+ ^+i < p k > (1— < Pk >)• For a 



different prior - uniform on a quadrant of the hypersphcrc defined by p-i — uij 
~ the integrals have been evaluated in [2] as < p k >= 1 . n (f k + and 

m+ i +n / 2 < Pk > (1— < Pk >)• For a general prior we use the 
average value theorem from Analysis to obtain 



< Pk > 



< Pk >o 



< (A Pk ) 2 >-~ 



< (A Pk ) 2 > + 



<Pk >' 



where <;' k and <;' fc ' are points in the unit hypercube close to the maxima of 

n?=iP^(i - Z)"=iP*)) P fc n? =lP ?'j(i - Er=iPi) and p 2 fe n? =lP r<5(i - EIU^), 

correspondingly, and the zero-subscript quantities are those corresponding to 
uniform prior. Assuming abundance of data (large Sj, and, correspondingly, m) 
and smooth prior, it can be shown that <;' k — <; ~ ^ and ^ ~ where 
(rifc)j = 5fej. Hence, expanding to the lowest non-trivial order of 1/m 



< p fc >=< p fc > 

< (Ap fe ) 2 >=< (A Pk ) 2 > 



1 + I^) +0 (i,) 
m 7T / 



m 7r 



+ 



3 3|tT _ ^fcTT 2 
2 7T 7T 



+ 0(Ar) 



Thus, we recognize that the arbitrariness of the probability prior induces multi- 
plicative noise in the assigned probabilities, and affects their variances both by 
rescaling and shifting. It is also worthwhile noticing that the only instance of as- 
signing zero probability would be due to the choice of the prior; measurements 
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alone, no matter how numerous, cannot force us to assign strictly vanishing 
probabilities. 

In the other extreme - no (to = 0) data available - the probability assignment 
derives through (eqn.2) strictly from the prior: 

< p k > = Af- 1 J d n p p k 7r(p)S(^ =lPl -l)=pT 

For one performed measurement (to = 1) that produced a result in bin i 



<p k >=JV 1 / d n p pip k Tr(p)5{Z™ =lPl 



^ = PkPi 
Pi 

and analogously for higher values of m. Probabilities are most useful when 
little or no data is available, and it is seen that such "ignorance" probability 
assignments for measurable quantities are, not surprisingly, entirely determined 
by the choice of the prior n(p). 

An interesting result is obtained when we go to the continuum limit n — > oo. 



With p k = J dxp(x) = f 



x k +Ax k 



dxp(x) = Ax k p(x k ) + jAx 2 k p'(x k ) H , the 

usual identification p k = Ax k p(x k ) for Ax k — > only makes sense when the 
probability density p(x) is everywhere diffcrentiable in [a, b]. In order to avoid 
handling ugly continual integrals, we perform the n — > oo limit at the stage 
where, with fi k = and a(x k ) = lim,^^ 



<p(x k ) >= n(x k ) lim < Pk >0 = /J,(x k ) lim 

n— >oo LAXfc n—yoo 



to n 
f(x k ) + 



m + n 



m + n 



a(x k ) 



We observe that, for any finite amount of data (to < oo) the assigned probability 
density < p(x) >= ^(x)er(x) depends on the metrics a and the prior but not on 
the data, while for to = oo the result depends on the order in which the limits 
are taken. Only for to — > oo before n — >• oo is the result proportional to the 
"sample frequency" density f(x). 

To summarize, in order to relate probabilities (opinions) to the real world 
(sample frequencies), we need the help of the Bayes theorem where a probability 
prior enters the game. Hence, even when a lot of data is available, the probability 
assignments are not unambiguous - the arbitrariness of the prior manifests itself 
as a multiplicative noise in the probabilities and in their variances. When little 
or no data is available the assignments derive directly from the chosen prior. Let 
us also emphasize an important lesson from the above: the widely held opinion 
that a probability distribution represents a "state of knowledge" is wrong. It 
is rather the sample frequencies, coming from observations, which constitute 
"knowledge". Probabilities are necessarily inferred, and thus represent only a 
"state of belief . The importance of this subtle distinction will become apparent 
in what follows. 
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3 Assigning Probabilities 



The most intellectually appealing scheme for assigning probabilities, in our opin- 
ion, was put forward by Jaynes around the middle of the last century, under the 
name "Maximum Entropy" (MaxEnt) principle. It is very difficult for a rational 
person to argue with its general formulation, which simply calls for inferential 
coherence by prescribing the assignment of the least committed probability dis- 
tribution consistent with all available information. However, opinions rapidly 
start to diverge when it comes to specifying how exactly the " least committed" 
distribution is defined and what exactly constitutes " available information" . On 
the first point, Jaynes itself maintained that the "least committed" distribution 
is the one with maximal Shannon entropy. His, and many others, affinity to 
Shannon's entropy was based on a number of appealing properties it possesses. 
During the years a tremendous amount of effort was invested into trying to prove 
that it is "the one and only" reasonable criterion to use. Eventually, however, 
two things were, or should have been, understood: 1) The Shannon's entropy 
is but a particular instance of a larger class of equally reasonable Renyi's en- 
tropies; and 2) The use of Jaynes procedure as a probability assignment rule 
is untenable, so it gradually evolved into probability updating rule - leaving us 
where we started, with the necessity of assigning an ignorance prior. On the 
second point, the available information is most often presented as a number 
of prescribed expectation values. Jaynes himself was aware of the conflict be- 
tween the expectations being characteristics of probability distributions, and as 
such, essentially opinions, and actual information obtained by measurements, 
but he took the position that the available information entered in the form of 
constraint (s) on the probability distribution even if "It might ... be only the 
guess of an idiot" [3]. Before we embark on the ambitious task of trying to 
clarify these points, let us briefly address the question of "once assigned, how 
can probabilities be used?" . 

3.0.3 What Use are Opinions? 

Probabilities being subjective, it is not immediately obvious how practical use 
can be made of them. In statistical sense, probabilities are the best estimators 
of sample frequencies, and this is about the only guiding principle for their 
use. Hence, it appears that plugging probabilities in place of sample frequencies 
in various statistical estimators would allow us to infer predictions about the 
results of measurements not yet performed. Such statistical estimators are the 
Kolmogorov-Nagumo averages [4], defined as < A ><p= 4> 1 (X) Pi4>{A-i)) where 
4>{x) is continuous and strictly monotonic function, A is an observable, and 
Ai is the value of A corresponding to bin i. Different functions <j) m general 
produce different values of < A >^. When measuring physical observables, we 
can use rulers in different units and origin of the scale. Without an appropriate 
behaviour of the predictions for the results of measurements upon rescaling 
and shifts they would be useless. Therefore, an important requirement to be 
imposed on an useful estimator is that < aA + (3 ></,= a < A >^ +/?, where 
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a and /3 are arbitrary constants. It is an elementary exercise to show that this 
forces <f>(x) = x and thus singles out < A >= J^iPi^i as the rule for predicting 
the result of a measurement of A given the probabilities {p}0. The result of an 
actual measurement will most likely differ from the prediction, yet this is still 
the best we can do with a probability assignment {p}. 

3.0.4 The Constraint Rule 

Let us first try to make the MaxEnt principle formulation more explicit in 
its "using all avilable information" part. The physical problem under con- 
sideration can be envisioned as the one of studying a set of observables of a 
system, which we will refer to as "the primary observables". This could be, 
e.g. the three coordinates x of a material point etc. We seek to assign a 
probability distribution p(x) for these observables, which would allow us to a) 
Predict the results of future measurements of these observables as their ex- 
pectations < x >= Jdxxp(x), which is of primary interest, and b) Predict- 
ing the result of future measurements of any additional obervable Q(x) as 
< Q >= J dxQ(x)p(x), which is of secondary interest. In doing this, we are 
generally ignorant, except possibly for the results {a} of previous measurements 
of some m observables A r , r = 1, 2, • • ■ , m. Then the constraint rule of the Max- 
Ent principle can be regarded as a requirement that the asigned probability 
distribution correctly "predicts" the results of the already performed measure- 
ments as a r = J da;A r (x)p(x), r = 1, 2, • • • , m. In other words, the constraint 
rule simply forces the probability assignment, which is to be used to predict 
the results of future measurements, to be consistent with the results of mea- 
surements already performed. Let us stress that what is involved here are single 
measurements and their results {a}, and not multiple measurements from which 
the a — s are obtained as sample averages, as is too often implied in the context 
of the MaxEnt. Indeed, if the results of, say, 10 measurements of, e.g., A\ were 
known as a\{i),i = 1,2, ■•■ ,10 and a\ was taken as aT = Y^,=i a i(10) to 
be used as a constraint, this would be in a blatant violation of the "using all 
available information" principle, since the set of measured values of A\ clearly 
contains information also about a[s variance: Aaf = |X)!£i[ a i(*) ~al] 2 an d, 
similarly, for its higher moments as well. 

3.0.5 The Expectation as (sort of) a Parameter 

Before we embark on the problem of assigning probabilities, we need to shortly 
discuss the parameterization of our probability distributions in terms of the 
expectations of their primary observables. For simplicity we will assume one 
primary parameter x, the case with multiple such parameters being a straight- 
forward generalization. In fact, we don't need to consider a full-fledged param- 
eterization in which the value of the parameter is equal to the expectation of 

2 One might be tempted to argue in favor of the most probable value instead, but one 
immediate indication that this is not a good rule is that it cannot produce any prediction for 
uniform probabilities. 
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x, but just one that would allow us to independently vary the expectation of 
x. Thus, we are interested in a parametrization p(x;x e ) such that, for any 
|e| << 1, we have J dxxp(x; x e +e) =< x > +e + 0(e 3 ) while the normalization 
of the probability distribution as well as all other cumulants C n (x) of x are 
preserved 

d f d f 

dxp(x;x e ) — — — / dxC n (x)p(x; x e ) = n — 2, 3, • ■ • 



dx e J ' dx e 

We formulate the following Conjectural: A parameterization with the above 
properties is only possible if the probability distribution fulfills certain condi- 
tions at the border of its domain, and in this case it is given by p(x; x e ) = p(x + 
x e ). Obviously, with such a parameterization we always have dp g^ x ^ = a P(^c) ^ 
which is the property we are mainly interested in. Establishing this, we finally 
can address the "most uncommitted" element of the general MaxEnt principle. 

3.0.6 Assigning Robust Probabilities 

We have shown above that probability assignments based on observations have 
inherent indeterminacy due to the arbitrariness of the probability prior. There- 
fore, a natural question to ask is whether an assignment exsists that is, in some 
sense, robust against variations of the prior. As already demonstrated, the lat- 
ter cause multiplicative noise in the probabilities. Hence we try to formulate a 
robustness requirement in terns of a probability distance of the Ali-Silvey type 

D(p;p + Sp) — > min, where Sp is the probability noise. As well known, for 

M 

normalization-preserving <5p(x) 

D{p-p + 5p) = |y dxp(x) (jj^f) +0(5p 3 ) 

where the constant coefficient a ~ 1 depends on the particular distance used. 
With a general multiplicative <5p(x) = e(x)p(x) the norm-preserving variation 
of this with respect to p(x) does not produce a solution. Hence, for the most 
general probability noise our robustness requirement is not selective enough to 
single out a particular distribution. However, upon some reflection, we realize 
that not all possible perturbations in the distribution are of equal importance: 
we are mainly interested in the robustness of the probabilities with regard to the 
perturbations which would have maximal effect on the primary observables, that 
is, choose the multiplicative noise such that Sp(x) = e(x)p(x) = e'(x) 
With this noise 



9p(x;x e ) 
3x e 



D(p;p + Sp) = % [ dxp-H*i xe)e' (x) ■ 9P( * ; Xe) dp( *' Xe) ■ £ '(x) + 0(e«) < 
2 J dx e dx e 

< | J cfe£' 2 (x)Tr/ F (x e ) 
3 The space restrictions do not allow us to formulate this as a theorem here. 



7 



where Jp(x e ) = J dxp^ 1 (x; x e ) 3?1 !^ x ^ d p ^*' Xe ) * s t ne Fisher information matrix 
with respect to x e and the inequality follows from its postivedefinitness. Hence, 
for an arbitrary noise factor e'(x) the tightest bound on the distance results 
from the distribution with minimal trace of Ip (x e ). Using the interchangeability 
of the derivatives derived above, we arrive at the final form of the robustness 
condition where x e does not play a role any more and is therefore dropped 



When results of measurements of some observables are known, the above min- 
imization is constrained such that the resulting probabilities reproduce these 
observables. Is there any sense in which the so characterized distribution could 
be considered "the least committed"? The Kramer- Rao result for the most ef- 
ficient estimator of x e in the form Tr [cow -1 (x e )] = Tr/f(x e ) indicates that in 
the situation where f>(x) is the one with minimal trace of /f(x e ) an invariant 
measure of the magnitude of the primary observables' covariance is maximal. 
This can be formulated as " The distribution with minimal trace of the Fisher 
information is the one for which the most efficient estimator of the primary ob- 
servables (whether it actually exists or not) has the worst possible performance" . 
Thus the extremal property of p(x) can indeed be interpreted as the distribution 
being "the least committed" with regard to the primary observables. 

4 Discussion 

Previously [5] we have derived the same condition (in the one-dimensional case) 
for assigning uniformative probabilities from the requirement that they be the 
least sensitive to coarse-graining. The rational for this requirement was that 
coarse-graining decreased the " information content" - if such a thing could be 
meaningfully defined - and the distribution with minimal information content 
to start with would be the one least affected by it. The approach is, in a sense, 
complementary to Bernardo's reference priors, where information is gained and 
the effect of this gain - maximized. However, in contrast to Bernardo's, our 
result does not depend on which particular distance is used to measure the 
sensitivity of the probabilities. That the same assignment rule would result 
from the present, quite different, considerations may bear some yet unidentified 
significance. Fisher information-like constructs appear almost universally in 
physics [6] and one cannot help but wonder to what extent physics laws could be 
explained as information processing rules, and answer Toffoli's question "Where 
does Nature shop for its Lagrangians" . 

Of significant interest is also whether/how the same considerations apply 
to probabilities on discrete domains. In physics, discrete domains are usually 
obtained by coarse-graining of continuous ones, and thus are "loaded" with 
properties inherited from the topology and the metrics of the original contin- 
uum. Such remnants could be, for example, various nearest, second nearest 
etc. neighbour hierarchies. Choosing the simplest case of a coarse-grained 
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segment of the real line, the disrete domain is {1,2, ...,n} and the relevant 
observable is < i >= nint(^™ = i iPi), where the function returns the near- 
est integer to its argument. It can be conjectured as in the continuous case 
that the only possible way to perturb the probabilities while best preserving 
their normalization and higher cumulants of i, again subject to certain con- 
ditions on pi and p n , is equivalent to successive application of p[ = pi — epi, 
Pi+i = Pi+i+£Pi, where the multiplicativity of the noise is explicitely used. Then 



D(jp;p') = § e 2 Er=iP? (j- + ^7) + 0(e 3 ) = f £ 2 (l + ZtiP^) + 0(e 3 ). 



The maximal robustness with respect to < i > is achieved for a distribution for 
which the distance is minimal, hence 



Here the role of the Fisher information is played by the Renyi's distance of order 
2. 
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